{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Reference: https://jupyterbook.org/interactive/hiding.html\n",
    "# Use {hide, remove}-{input, output, cell} tags to hiding content\n",
    "\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import pandas as pd\n",
    "import seaborn as sns\n",
    "%matplotlib inline\n",
    "import ipywidgets as widgets\n",
    "from ipywidgets import interact, interactive, fixed, interact_manual\n",
    "from IPython.display import display\n",
    "\n",
    "sns.set()\n",
    "sns.set_context('talk')\n",
    "np.set_printoptions(threshold=20, precision=2, suppress=True)\n",
    "pd.set_option('display.max_rows', 7)\n",
    "pd.set_option('display.max_columns', 8)\n",
    "pd.set_option('precision', 2)\n",
    "# This option stops scientific notation for pandas\n",
    "# pd.set_option('display.float_format', '{:.2f}'.format)\n",
    "\n",
    "def display_df(df, rows=pd.options.display.max_rows,\n",
    "               cols=pd.options.display.max_columns):\n",
    "    with pd.option_context('display.max_rows', rows,\n",
    "                           'display.max_columns', cols):\n",
    "        display(df)"
   ],
   "outputs": [],
   "metadata": {
    "tags": [
     "remove-cell"
    ]
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "# Exercises"
   ],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "- Use the `baby` data to create a plot of how popular your name was over time.\n",
    "  If you used that plot to make a guess at your age, what would you guess? Is\n",
    "  that close to your actual age? Think of a potential reason."
   ],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "- In this chapter we talked about how to use `.loc` and `.iloc` for slicing.\n",
    "  We've also shown a few shorthands. For each of these shorthand code snippets,\n",
    "  convert them to the equivalent code that uses `.loc` or `.iloc`.\n",
    "\n",
    "  ```python\n",
    "  baby['Name']\n",
    "  ```\n",
    "\n",
    "  ```python\n",
    "  baby[0:5]\n",
    "  ```\n",
    "\n",
    "  ```python\n",
    "  baby[['Name', 'Count']]\n",
    "  ```\n",
    "\n",
    "  ```python\n",
    "  baby[baby['Count'] < 10]\n",
    "  ```"
   ],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "- What's the difference between running:\n",
    "  \n",
    "  ```python\n",
    "  baby['Name']\n",
    "  ```\n",
    "  \n",
    "  and:\n",
    "\n",
    "  ```python\n",
    "  baby[['Name']]\n",
    "  ```\n",
    "\n",
    "  And, why does this code work:\n",
    "\n",
    "  ```python\n",
    "  baby[['Name']].iloc[0:5, 0]\n",
    "  ```\n",
    "  \n",
    "  but this code errors?\n",
    "\n",
    "  ```python\n",
    "  baby['Name'].iloc[0:5, 0]\n",
    "  ```"
   ],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "- The first code snippet below makes a dataframe with 6 rows, but the second\n",
    "  makes a dataframe with 5 rows. Why?\n",
    "  \n",
    "  ```python\n",
    "  baby.loc[0:5]\n",
    "  ```\n",
    "\n",
    "  ```python\n",
    "  baby.iloc[0:5]\n",
    "  ```"
   ],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "- When plotting male and female baby names over time, you might notice that\n",
    "  after 1950 there are generally more male babies. Is this trend reflected in\n",
    "  the U.S. census data? Go to the Census website\n",
    "  (https://data.census.gov/cedsci/) and check."
   ],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "- Find the five names with the highest standard deviation of yearly counts.\n",
    "  What might a large standard deviation tell you about the popularity of these\n",
    "  names over time?"
   ],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "- Find the five names with the highest *interquartile range* of yearly counts.\n",
    "  The interquartile range is the 75th percentile minus the 25th percentile of\n",
    "  the data. You may find the `pd.Series.quantile()` function useful ([link to\n",
    "  documentation][quantile]). Are these names different than the names with the\n",
    "  highest standard deviation? Why might this happen?\n",
    "\n",
    "[quantile]: https://pandas.pydata.org/docs/reference/api/pandas.Series.quantile.html"
   ],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "\n",
    "- We've shown this syntax for grouping:\n",
    "\n",
    "  ```python\n",
    "  baby.groupby('Year')['Count'].sum()\n",
    "  ```\n",
    "\n",
    "  This code also does the same thing:\n",
    "\n",
    "  ```python\n",
    "  baby.groupby(baby['Year'])['Count'].sum()\n",
    "  ```\n",
    "\n",
    "  The second syntax passes a `pd.Series` into `.groupby()`. It's a bit more\n",
    "  verbose but also gives more flexibility. Why is this syntax more flexible?\n",
    "\n",
    "  Hint: What does this code do?\n",
    "\n",
    "  ```python\n",
    "  baby.groupby(baby['Year'] // 10 * 10)['Count'].sum()\n",
    "  ```"
   ],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "- Let's say you want to find the most popular male and female baby name each\n",
    "  year. You might write this:\n",
    "\n",
    "  ```python\n",
    "  (baby\n",
    "   .groupby([['Year', 'Sex']])\n",
    "   [['Count', 'Name']]\n",
    "   .max()\n",
    "  )\n",
    "  ```\n",
    "\n",
    "  But this code doesn't produce the right result. Why?\n",
    "\n",
    "  Now, write code to produce the most popular male and female name each year,\n",
    "  along with its count. *Hint:* you can make use of the fact that within each\n",
    "  year and birth sex, the names are sorted in descending order of popularity."
   ],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "- Come up with a realistic data example where a data scientist would prefer an\n",
    "  inner join to a left join, and an example where a data scientist would\n",
    "  prefer a left join to an inner join."
   ],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "- In the section on Joins, the `nyt` table doesn't have any duplicate names.\n",
    "  But a name could feasibly belong to multiple categories---for instance,\n",
    "  `Elizabeth` is a name from the Bible and a name for royalty. Let's say we\n",
    "  have a dataframe called `multi_cat` that can list a name multiple\n",
    "  times---once for each category it belongs to:"
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 97,
   "source": [
    "multi_cat = pd.DataFrame([\n",
    "    ['Elizabeth', 'bible'],\n",
    "    ['Elizabeth', 'royal'],\n",
    "    ['Arjun', 'hindu'],\n",
    "    ['Arjun', 'mythological'],\n",
    "], columns=nyt_small.columns)\n",
    "multi_cat"
   ],
   "outputs": [
    {
     "output_type": "display_data",
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>nyt_name</th>\n",
       "      <th>category</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Elizabeth</td>\n",
       "      <td>bible</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Elizabeth</td>\n",
       "      <td>royal</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Arjun</td>\n",
       "      <td>hindu</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Arjun</td>\n",
       "      <td>mythological</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    nyt_name      category\n",
       "0  Elizabeth         bible\n",
       "1  Elizabeth         royal\n",
       "2      Arjun         hindu\n",
       "3      Arjun  mythological"
      ]
     },
     "metadata": {}
    }
   ],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "What happens when we join `baby` with this table? In general, what happens when\n",
    "there are *multiple rows* that match in both left and right tables?"
   ],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "- In a *self-join*, we take a table and join it with itself. For example, the\n",
    "  `friends` table contains pairs of people who are friends with each other."
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 102,
   "source": [
    "friends = pd.DataFrame([\n",
    "    ['Jim', 'Scott'],\n",
    "    ['Scott', 'Philip'],\n",
    "    ['Philip', 'Tricia'],\n",
    "    ['Philip', 'Ailie'],\n",
    "], columns=['self', 'other'])\n",
    "friends"
   ],
   "outputs": [
    {
     "output_type": "display_data",
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>self</th>\n",
       "      <th>other</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Jim</td>\n",
       "      <td>Scott</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Scott</td>\n",
       "      <td>Philip</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Philip</td>\n",
       "      <td>Tricia</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Philip</td>\n",
       "      <td>Ailie</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     self   other\n",
       "0     Jim   Scott\n",
       "1   Scott  Philip\n",
       "2  Philip  Tricia\n",
       "3  Philip   Ailie"
      ]
     },
     "metadata": {}
    }
   ],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "Why might a data scientist find the following self-join useful?"
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 104,
   "source": [
    "friends.merge(friends, left_on='other', right_on='self')"
   ],
   "outputs": [
    {
     "output_type": "display_data",
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>self_x</th>\n",
       "      <th>other_x</th>\n",
       "      <th>self_y</th>\n",
       "      <th>other_y</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Jim</td>\n",
       "      <td>Scott</td>\n",
       "      <td>Scott</td>\n",
       "      <td>Philip</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Scott</td>\n",
       "      <td>Philip</td>\n",
       "      <td>Philip</td>\n",
       "      <td>Tricia</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Scott</td>\n",
       "      <td>Philip</td>\n",
       "      <td>Philip</td>\n",
       "      <td>Ailie</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  self_x other_x  self_y other_y\n",
       "0    Jim   Scott   Scott  Philip\n",
       "1  Scott  Philip  Philip  Tricia\n",
       "2  Scott  Philip  Philip   Ailie"
      ]
     },
     "metadata": {}
    }
   ],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "- Have names become longer on average over time? Produce a plot to answer this\n",
    "  question. "
   ],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "- In this chapter we found that you could make reasonable guesses at a person's\n",
    "  age just by knowing their name. For instance, the name \"Luna\" has sharply\n",
    "  risen in popularity after 2000, so you could guess that a person named \"Luna\"\n",
    "  was born around after 2000. Can you make reasonable guesses at a person's age\n",
    "  just from the *first letter* of their name? Write code to see whether this is\n",
    "  possible, and which first letters provide the most information about a\n",
    "  person's age. "
   ],
   "metadata": {}
  }
 ],
 "metadata": {
  "celltoolbar": "Tags",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}